Robust indexing and retrieval of electronic ink

ABSTRACT

A unique system and method that facilitates indexing and retrieving electronic ink objects with improved efficiency and accuracy is provided. Handwritten words or characters are mapped to a low dimension through a process of segmentation, stroke classification using a neural network, and projection along directions found using OPCA, for example. The employment of OPCA makes these low dimensional representations robust to handwriting variations or noise. Each handwritten word or set of characters is stored along with neighborhood hyperrectangle that represents word variations. Redundant bit vectors are used to index the hyperrectangles for efficient storage and retrieval. Ink-based queries can be submitted in order to retrieve at least one ink object. To do so, the ink query is processed to determine its query point which is represented by a (query) hyperrectangle. A data store can be searched for any hyperrectangles that match the query hyperrectangle.

BACKGROUND

Pen based computers and personal digital assistants (PDAs) offer a morenatural means of input and are becoming increasingly common forcapturing handwriting, annotation, sketching, and other types of freeform input. The growing number of ink documents, both on personalcomputers and the internet, has created a need for efficient indexingand retrieval of ink documents. Unfortunately, traditional search andretrieval systems lack such capabilities or are too slow and cumbersomefor practical anytime use particularly when searching through more thana few hundred ink documents.

Searching in handwritten cursive text is a challenging problem. The sameword written by different persons can look very different. Further,multiple instances of the same word written by the same person aresimilar but not identical. In general, instance variations (for the sameperson) are smaller than inter-person variations (for the same word),which in turn are smaller than inter-word variations. Thus, to beeffective, successful electronic ink retrieval strategies must be robustto such handwriting variations.

Current approaches for ink retrieval fall into three categories: a)recognition based, b) template matching based, or c) shape matchingbased. Recognition based retrieval methods combine the output of ahandwriting recognizer with approximate string matching algorithms toretrieve similar ink. They are quite robust to handwriting styles andgeneralize well over both print and cursive writing. However, they havehigh computational requirements as the recognizer needs to be run onceon every entry in the data store during indexing. Moreover, during therecognition process much information about the particular shape of theletters (allographs), writing style, etc are lost. Good retrieval ratesrequire that the complete recognition lattice be stored and used duringmatching which adds significantly to the memory overhead. Furthermore,since handwriting recognition is heavily dependant on the associatedlexicon, recognition based retrieval approaches are inherently limitedby their accuracy and the applicability of the lexicon used. With a goodlexicon, state of the art handwriting recognizers achieve 80-90% wordrecognition accuracy. However, as soon as the lexicon is removed, wordaccuracy drops down to 60-70%. In addition, such recognition basedretrieval techniques often break down for arbitrary ink input such aspen gestures and line graphics such as flow charts, hand drawn maps,etc.

Unlike other forms of shape data, electronic ink has time informationthat can be effectively used in template matching algorithms. Dynamictime warp (DTW) is the most prevalent matching technique for producingreliable similarity scores between handwritten words. DTW is veryaccurate, but is an O(n²) algorithm, where n is the number of points. Asa result, it is computationally prohibitive for use in retrievalapplications involving large databases of words. DTW based templatematching techniques are well suited for searching through all words in asingle page or document and are commonly used for buildingfind-and-replace type of features supported by modern ink capabledocument editors and word processors.

Shape matching algorithms decompose the input shape into a bag of shapefeatures. Two shapes are compared for similarity based on the minimumcost of matching features from one shape to the features of the other.The lower the matching cost the better the match (with zero as a lowerbound). Unlike DTW matching, the shape features typically discard alltime information. Unfortunately, as in the case of DTW, computing theoptimal matching for a single shape comparison has a complexity that issuper-polynomial in the number of features. Thus, shape matchingalgorithms have been effective over small database sizes but impracticalover anything larger than a few hundred words.

All of the above approaches for ink retrieval rely on a linear scanthrough the database for each query which tends to be slow. Sequentialevaluation combined with early termination is commonly employed whilecomputing match scores to avoid long query times.

SUMMARY

The following presents a simplified summary in order to provide a basicunderstanding of some aspects of the systems and/or methods discussedherein. This summary is not an extensive overview of the systems and/ormethods discussed herein. It is not intended to identify key/criticalelements or to delineate the scope of such systems and/or methods. Itssole purpose is to present some concepts in a simplified form as aprelude to the more detailed description that is presented later.

The subject application relates to a system(s) and/or methodology thatprovide a new approach for indexing and retrieving handwritten words orsymbols captured using an electronic pen or tablet. In particular,handwritten words (cursive or print) are first segmented and featurizedusing a neural network. The neural network can serve as a classifier toclassify the one or more segments. These features are indexed using ahashing scheme that is both locality sensitive as well as robust toinput noise. Oriented principal component analysis (OPCA) can be used tomap these features into a low dimensional space, thereby facilitatingrobustness to handwriting variation (noise). Redundant bit vectors areused to index the resulting low dimensional representations forefficient storage and retrieval.

More specifically, an ink word or input ink can be initially normalizedand segmented and then can be mapped to a membership matrix using aneural network. The membership matrix can then be projected into a lowdimensional space defined by at least a subset of OPCA directions (e.g.,first 32 directions). Through this process, each input ink (word) canbecome a point in the low dimensional space. Distorted versions of theinput ink can also be represented as points in the low dimensionalspace. The combination of computations involving Chebyshev polynomials,classification processing by the neural network, and the directionsfound using OPCA cause these points for the same word (or input ink) tocluster together closely. The input ink along with its variations can berepresented by an enclosing hyperrectangle (or hypersphere). Using OPCAcauses these hyperrectangles to be small, thus they can be readily andefficiently stored and/or indexed while the database is built toaccommodate subsequent ink search and retrieval.

To the accomplishment of the foregoing and related ends, certainillustrative aspects of the invention are described herein in connectionwith the following description and the annexed drawings. These aspectsare indicative, however, of but a few of the various ways in which theprinciples of the invention may be employed and the subject invention isintended to include all such aspects and their equivalents. Otheradvantages and novel features of the invention may become apparent fromthe following detailed description of the invention when considered inconjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high level block diagram of an indexing system thatfacilitates more efficient retrieval of ink objects or items.

FIG. 2 is a block diagram of an indexing system that facilitates moreefficient retrieval of ink objects or items.

FIG. 3 is a diagram that demonstrates an example of segmentationwhereupon the word cursive has been split into 12 segments.

FIG. 4 is a diagram of an exemplary membership matrix for the ink wordJohn.

FIG. 5 is a diagram of ink samples of the input ink word Text written bythe same user.

FIG. 6 is a diagram-of distorted ink samples for the input ink wordJohn.

FIG. 7 is a graphical representation of a generalized eigenvalue spreadover OPCA projections.

FIG. 8 illustrates query space partitioning and bit vector generationfor one dimension as performed in order to index electronic ink usingredundant bit vectors.

FIG. 9 is a block diagram of an ink retrieval system that can beemployed in conjunction with the systems of FIGS. 1 and/or 2, above, toobtain query results based on query ink.

FIG. 10 illustrates a graphical representation of experimental dataresults obtained when applying the system of FIGS. 1 and/or 2 on aUS-Natural dataset.

FIG. 11 illustrates a graphical representation of experimental dataresults obtained when applying the system of FIGS. 1 and/or 2 on aUNIPEN dataset.

FIG. 12 illustrates a graphical representation of experimental dataresults obtained when applying the system of FIGS. 1 and/or 2 on aUS-Natural dataset.

FIG. 13 illustrates a graphical representation of experimental dataresults obtained when applying the system of FIGS. 1 and/or 2 on aUNIPEN dataset.

FIG. 14 is a flow diagram illustrating an exemplary methodology thatfacilitates indexing and retrieval of electronic ink using redundant bitvectors.

FIG. 15 is a flow diagram illustrating an exemplary methodology thatfacilitates indexing and retrieval of electronic ink using redundant bitvectors.

FIG. 16 is a flow diagram illustrating an exemplary methodology thatfacilitates retrieving electronic ink from data store(s) in which theink has been indexed using redundant bit vectors.

FIG. 17 illustrates an exemplary environment for implementing variousaspects of the invention.

DETAILED DESCRIPTION

The subject systems and/or methods are now described with reference tothe drawings, wherein like reference numerals are used to refer to likeelements throughout. In the following description, for purposes ofexplanation, numerous specific details are set forth in order to providea thorough understanding of the systems and/or methods. It may beevident, however, that the subject systems and/or methods may bepracticed without these specific details. In other instances, well-knownstructures and devices are shown in block diagram form in order tofacilitate describing them.

As used herein, the terms “component” and “system” are intended to referto a computer-related entity, either hardware, a combination of hardwareand software, software, or software in execution. For example, acomponent may be, but is not limited to being, a process running on aprocessor, a processor, an object, an executable, a thread of execution,a program, and a computer. By way of illustration, both an applicationrunning on a server and the server can be a component. One or morecomponents may reside within a process and/or thread of execution and acomponent may be localized on one computer and/or distributed betweentwo or more computers.

Unlike previous approaches to ink retrieval, the subject system andmethod can be effectively employed to search over very large collectionsof ink documents or objects obtained from hundreds of users with severalhundreds of thousands to millions of handwritten words. Query times canbe significantly reduced or minimized while achieving high accuracy inthe results returned. The subject system and method has been evaluatedon a large database containing 898,652 handwritten words. Theexperimental data will be described later with respect to FIGS. 10-13.In direct contrast, such previous approaches tend to only be effectiveon relatively small databases of a few hundred words. They may bewell-suited for solving find-and-replace type of applications that workwith a single ink document but tend to be too slow and are overallineffective on a larger, more realistic scale.

Referring now to FIG. 1, there is a high level block diagram of anindexing system 100 that can be employed to construct one or moreindices for ink content in order to facilitate the retrieval of inkobjects such as documents. The system 100 can include a featurizationcomponent 110 that featurizes segmented input ink to yield a membershipmatrix for each input ink. This featurization process can involvesegmenting the input ink and computing stroke features for each segment.The featurized segments can then be classified by a neural network inorder to reduce or mitigate entropy/variance in the stroke features. Theneural network can return stroke or segment membership scores for eachsegment of the input ink and can learn the stroke memberships.Furthermore, the neural network computations can be used to form amembership matrix for the particular input ink.

The membership matrix can be communicated to a processor component 120that projects the membership matrix into a low dimensional space. Thisis performed in order to make the system 100 robust to variations thatcan occur in the same word written by the same user. The low dimensionalspace can be defined by projections determined by linear discriminantanalysis (LDA) or OPCA, for example.

In addition, a number of distorted versions of the input ink can becreated and indexed with the undistorted version. The input ink as wellas its distorted counterparts can be represented by hyperrectangles bythe indexing component 130 and then stored in one or more indices or adata store. When querying the data store for an ink document (e.g., itshyperrectangle form), a nearest neighbor search can be performed using abrute force linear scan through all stored hyperrectangles to look forany overlap. For large data stores, this scanning process can berelatively slow and impracticable especially if employing a conventionalapproach. However, the subject indexing component 130 indexes thehyperrectangles using redundant bit vectors which effectively speeds upthe nearest neighbor search. A more detail discussion on this can befound below in the following figures.

Turning now to FIG. 2, there is a block diagram of an indexing system200 that can be employed to construct one or more indices for electronicink in order to facilitate the retrieval of such ink. The system 200includes a pre-processing module 210 that normalizes handwritten inputink. In particular, the pre-processing module 210 can smooth, centerand/or scale the input ink to a unit height while preserving its aspectratio. The normalization process makes ink retrieval immune totranslation and scale variations.

The normalized input ink can then be split into segments by asegmentation module 220. The segmentation module can cut the input inkat the bottoms of the characters. FIG. 3 presents a segmentation exampleof the word cursive 300—which has been written in cursive in two strokes(one for the word and one for the dot). Segmentation takes place wherethe y coordinate reaches a minimum value and starts to increase (cullingpoints are noted as the dots in 310). The segments 320 are then sortedby the x-coordinate of their left most point. Each of the segments 320can then be represented in the form of a Chebyshev polynomial. Thecoefficients of the truncated Chebyshev polynomial become the segmentfeatures. The segment features are invertible and can be used toreconstruct approximate versions of the source segments.

Referring again to FIG. 2, the featurized segments can then be processedor classified by a classification component 230. The classificationcomponent 230 can employ a neural network—such as a time delay neuralnetwork (TDNN)—to produce segment membership scores for each segment.Using the shape of the segment, one can determine the probability of thesegment belonging to each possible character in the language. Theseprobabilities are captured by the segment membership scores. Forexample, segment 3 in FIG. 3 can belong to any of the followingcharacters: i, u, w, a, etc. The neural network examines each segmentand returns a vector of character membership scores.

The character membership vector has two entries for each supportedcharacter: one to identify the beginning of a character(begin-membership); and the other to mark the continuation of thecharacter (continuation-membership). For example, segment 3 in FIG. 3would have a high continuation-membership score for the letter a and alow begin-membership. The TDNN takes three consecutive segments as inputand outputs the segment membership scores for the middle segment. Theprevious and next segments are included to provide context that helps inreducing ambiguity. For the first and last segments, zero inputs areused in place of the missing previous and next segments, respectively.Having two membership entries for each character and providing strokecontext as input makes it easier for the TDNN to learn strokememberships.

The TDNN outputs are collected as consecutive columns to form themembership matrix. The membership matrix is normalized (to have the samesize for all words) by padding with zero columns to the right. Eachcharacter produces a few segments and each segment has only a fewnon-zero membership entries. As a result, though the membership matrixappears large, it is typically very sparse with the number of non-zeroentries being a small multiple of the number of characters in the word.Note that a membership matrix can be computed in a number of ways: adifferent classifier can be substituted for the TDNN; or the rows of thematrix can correspond to allographs, rather than characters; or thecolumns can represent spatial regions rather than segments.

FIG. 4 displays the membership matrix 400 for a sample ink word Johnwith the all-zero rows and columns omitted. The first segment is clearlyrejected (all memberships are zero). Though the segments for J, o, and nare quite unambiguous, the segments for h show non-zero membershipvalues for h, k, m, n, and r, as expected. The segment classificationstep significantly reduces the entropy present in the stroke features.This type of TDNN is very commonly used as a front end in handwritingrecognition systems for stroke featurization. It is trained usingback-propagation and word labels. It is important to note that forindexing ink, a high stroke recognition rate is not necessary since thepurpose of the TDNN is the reduction of entropy in the stroke features.Even a poor recognizer can be sufficiently useful for indexing.

Handwritten electronic ink presents a challenge to many retrievalsystems because of the variations that can occur in the writing of thesame word, character, or symbol by the same user. In FIG. 5, multipleinstances of the same word Text have been written by the same user. Ascan be seen, the instances are similar but not identical. Unliketraditional ink retrieval systems, the subject system 200 in FIG. 2 canbe robust to such variations in part through its employment of areduction component 240.

Initially, an ink distortion model is used to generate a large number ofdistorted copies (typically 100) for each word during indexing. Asequence of noise samples from a two dimensional Gaussian randomvariable is low pass filtered to generate the distortion field. Low passfiltering is achieved by convolving the noise sequence with a standardGaussian. FIG. 6 shows some distorted samples produced using thisapproach. A low distortion level was used. The noise magnitude wasarbitrarily chosen to be 1/500^(th)=0.2% of the vertical span of theword (word height). This corresponds to a standard deviation of 0.002with the input ink being normalized to have unit height. A betterapproach may be to estimate and use the actual variance present inmultiple word instances (e.g., FIG. 5). These variances are likely tovary from word to word and also individual to individual, so theselection of noise parameters should be performed with care.

Membership matrices can be created for the distorted versions of theinput ink. The membership matrices for the undistorted input ink and itsdistorted versions can be projected into a low dimensional space definedby projection directions determined by OPCA by way of the reductioncomponent 240. OPCA can be used to find projection directions for themembership matrix that maximize signal variance and minimize noisevariance. The first few projections with the largest eigenvalues capturemost of the variance and can be used to build a low dimensionalrepresentation of the membership matrix that is well suited forindexing.

In particular, suppose there are given a set of vectors x_(i) ε R^(d),i=1, . . . , m, where each x_(i) represents a signal (undistortedsamples). Suppose that each x_(i) has a set of N distorted versionsx_(i) ^(k), k=1, . . . , N. For indexing ink, the x_(i) and x_(i) ^(k)are membership matrices for undistorted ink words and their distortedversions. It should be appreciated that the membership matrix can beconverted into a vector by sequentially concatenating its columns fromleft to right.

Let z_(i) ^(k) be the corresponding noise vectors, defined by z_(i)^(k)=x_(i) ^(k)−x_(i). OPCA tries to find linear projections which areas orthogonal as possible to the z_(i) ^(k) for all k, but along whichthe variance of the original signal x_(i) is simultaneously maximized.The OPCA directions are defined as directions (u) that maximize thegeneralized Rayleigh quotient

$\begin{matrix}{q = \frac{u^{\prime}{\sum\limits_{x}\; u}}{u^{\prime}{\sum\limits_{z}\; u}}} & (1)\end{matrix}$where Σ_(x) and Σ_(z) are the covariance matrices for the signal andnoise vectors. This modified version uses correlation matrices ratherthan covariance matrices, since the mean noise signal as well as itsvariance should be penalized. Explicitly, Σ_(x) and Σ_(z) can becomputed using

$\begin{matrix}{\sum\limits_{x}\;{= {\frac{1}{m}{\sum\limits_{i}\;{\left( {x_{i} - {E\lbrack x\rbrack}} \right)\;\left( {x_{i} - {E\lbrack x\rbrack}} \right)^{\prime}}}}}} & (2) \\{\sum\limits_{z}\;{= {\frac{1}{mN}{\sum\limits_{i,k}\;{\left( z_{i}^{k} \right)\;\left( z_{i}^{k} \right)^{\prime}}}}}} & (3)\end{matrix}$The optimal directions u can be found by solving the generalizedeigenvalue problemΣ_(x)u=qΣ_(z)u  (4)to obtain (u_(i),q_(i))=(eigenvector, eigenvalue) pairs. Most standardnumerical linear algebra packages can solve such generalized eigenvalueproblems. As in principal component analysis (PCA), one can find a lowdimensional representation by selecting the first few (say p)eigenvectors with the largest eigenvalues for projection. FIG. 7presents example eigenvalue spreads for ink data from two datasets thatwere used in experiments. Unlike PCA, the directions found using OPCAare not necessarily orthogonal. However, similar words that have similarmembership matrices will have similar projected vectors. Similarity hereimplies a small Euclidean (L₂) distance between the vectors.Mathematically, the projected representation of membership matrices isgiven byy_(i)=u′_(i)x, i=1, . . . , p  (5)where the x_(i) are column vector representations of membership matrices(signal vectors). For indexing ink, the experimental data indicated thatthe first 32 OPCA directions (e.g., p=32) are sufficient (see FIG. 7).However, it should be appreciated that the number of directions employedcan be greater or less than 32.

Moreover, each input ink (e.g., handwritten word) becomes a point in thelow dimensional space through this process. In the low dimensionalspace, each input ink, {wi}, i=1, . . . , M is stored using itshyperrectangle {y_(i), ε_(i)}, i=1, . . . , M, where y_(i) and ε_(i) arevectors of length p and represent the center and side lengths of thehyperrectangle. M is the number of words in the data store. An indexcomponent 250 as shown in FIG. 2 can index the hyperrectangles for eachinput ink using redundant bit vectors.

Redundant bit vectors are very useful for quickly solving spatial objectintersection problems that are often encountered in search. For storagepurposes, the p-dimensional projections of membership matrices areconsidered low dimensional. However, for spatialsearches, something aslow as 30 dimensions is considered large. A large dimension for spatialsearches implies that one can replace the hyperspheres typically usedfor proximity matching to hyperrectangles. Search computations usinghyperrectangles are much more amenable to indexing and optimization. Inview of this, hyperrectangles rather than hyperspheres can be indexed.

When searching for an ink document or object, the relevant index or datastore can be searched. Consider a query word, w_(q), with a projectedvector y_(q) ε R^(p). The search problem can be defined as follows:

-   -   Search Problem: find all of the words in {w_(i)} that are        similar to w_(q). w_(q) is defined to be similar to w_(i) if        y_(q) falls within the hyperrectangle defined by {y_(i), ε_(i)}.

This spatial search can be solved by using a brute force linear scanthrough all stored hyperrectangles and checking for overlap. For largedatabases, such a linear scan can be very slow. However, the lookupprocess speed can be increased by solving an approximate version of thespatial object intersection using redundant bit vectors. The approximateversion of the problem allows a small fraction of false positives andfalse negatives.

In particular, redundant bit vectors use redundancy to group queriestogether and are particularly suited to high dimensional point queriesover hyperrectangles. The redundancy strategy is based on partitioningthe query space rather than the data space.

Consider one of the dimensions in the hyperrectangle space. Each of thehyperrectangles spans an interval along this dimension as shown in FIG.8 (800). The query space corresponds to the whole x-axis. The queryspace is partitioned into b intervals, each corresponding to a bin. Itshould be understood that these intervals do not have to be uniform.Redundant bit vectors treat all queries within one bin identically. Thisidentical treatment does not introduce many false positives, becausenearby queries overlap the same rectangles. The redundant bit vectorsform an index of the stored hyperrectangles: this index containspre-computed information of which rectangle overlaps which bin. For eachbin, this index is refered to as a “bit-dimension.”

FIG. 8 (800) shows six hyperrectangles (R₁-R₆) along a query dimensionthat is partitioned into four bins. The first (left-most) and last(right-most) bins extend all the way to −∞, and +∞, respectively. Anyquery point will fall in one of these intervals. The number of intervalsin each dimension, b, was arbitrarily chosen to be 32 for this example.These intervals are typically chosen based on the distribution of thehyperrectangles along that dimension. A heuristic can be used thatattempts to evenly spread the selectivity among the b intervals (e.g.,over the bit-dimensions). Using this process, each of the p querydimensions are partitioned into b intervals resulting in pbbit-dimensions.

Each hyperrectangle, {y_(i), ε₁}, is represented using a vector, B_(i),with entries that are either 0 or 1. It is built by concatenatingrepresentations for each of the b bins for each of the p querydimensions. Each bit in this vector thus corresponds to one of the pbbit-dimensions Consider one of the query dimensions as shown in FIG. 8(800). Each bin along this dimension corresponds to one bit. Thehyperrectangle's interval along this dimension overlaps one or morebins. Bits corresponding to overlapping bins are set to 1, while thosefor non-overlapping bins are set to 0. FIG. 8 (810) shows the bitvectors (as column vectors) for the four bins along a query dimension.The bit vector representation highlights the redundancy that arises fromquery partitioning. If the bits for a single item are examined acrossall bit-dimensions for one stored hyperrectangle (rows in FIG. 8 (810)),they look like

Thus, the span of the stored hyperrectangle (relative to the binboundaries) is stored in a redundant binary code. Each intervalcorresponds to a bit-dimension (columns with boxes around the bits).Only one bit of this binary code (one bit dimension) need be accessedfor each query. These bit dimensions are helpful to achieving very fastlookups during retrieval. The bit vectors (columns in FIG. 8 (810)) forall bit dimensions are collected and stored as rows in a bit matrix, D.The bit matrix has pb columns and M rows. The columns correspond to bitdimensions and the rows correspond to data store items. A niceconsequence of using bit indices to represent the hyperrectangle bucketsis that set intersection now becomes a linear bitwise intersection ofthe associated bit strings (e.g., logically AND all the bit stringstogether). Note that, for the bit matrix D, the transpose can be storedand rows and columns may interchange roles.

Referring now to the ink retrieval system 900 in FIG. 9, looking up anew handwritten word can be accomplished as follows. First, the queryink (e.g., word) is represented by a procedure similar to processinginput ink for storage—via an ink processing module 910:

-   -   a) Normalize and segment the word;    -   b) Classify segments to obtain membership matrix;    -   c) Project membership matrix along the OPCA dimensions to obtain        y_(q);    -   d) The projected query point, y_(q), can be viewed as a        hyperrectangle, {y_(q), ε_(q)}, of zero volume, e.g., ε_(q)=0;    -   e) Compute the pb length binary representation, B_(q), for        {y_(q), ε_(q)}.        Note that only p bits are ON in B_(q).

A retrieval component 920 can search for all hyperrectangles in the datastore 930 that contain the query point y_(q) which is done by ANDingtogether the columns of the bit matrix, U, that correspond to non-zeroentries in B_(q). The result is a column bit vector of length M. A quickscan through the column bit vector for non-zero entries produces thequery result. Note that these query time linear bitwise operations arevery efficient since given a 32-bit CPU with 4-byte registers, setintersection between 2 sets for 32 data items is performed with one CPUoperation. A 64-bit processor can process set intersections twice asquickly. Furthermore, most of today's 32-bit CPUs support SIMDextensions (such as MMX and SSE2) that natively allow for up to 128-bitprocessing. These can further speed up bit operations. Also, given theordered nature of performing set intersection, excellent memorysubsystem performance is achieved. Memory accesses become linear innature and cache misses are relatively rare, leading to very highbandwidth from the memory subsystem. All these properties cumulativelywork to produce an algorithm for set intersection that, while linear inthe number of data entries, is extremely efficient.

The indexing and retrieval process described above takes a handwrittenquery word and returns a collection of data store items. The returneddata store items need to be presented in a sorted order with the mostsimilar entries at the top. The presence or absence of relevant entriesin the query result (hit list) determines the recall percentage. Given adata store of items, recall is determined by the choice of the number ofOPCA dimensions, p, and the number of bits per dimension, b. One way toincrease recall is to include extra neighboring intervals(bit-dimensions) during the query process. This can be achieved by:

-   -   a) adding one or more neighboring intervals in each dimension        and increasing the number of ON bits in B_(q) (widening the bit        vector),    -   b) ORing together the bit dimensions that correspond to the same        projection dimension, and then    -   c) ANDing together the resulting bit vectors for each projection        dimension.

This widening procedure can be iteratively used to gradually increasethe matching distance and generate progressively larger hit lists. Onthe other hand, precision depends not only on the number of relevantitems returned in the query result but also the total number of itemsreturned. At design time, precision can be improved by selecting morebit-dimensions, such as by choosing a larger b. At runtime, precisioncan be improved by sorting the entries in the query result using asimilarity metric and then truncating the list to return the top-F bestmatches. The DTW distance between ink samples can be used to determinesimilarity. Even though DTW is expensive to compute, it can be used forentries in the query result, which is usually a very small percentage ofthe whole data store. After sorting, the query result list can betruncated to return the top-1, top-10, or top-100 entries, for example.One can also use a maximum threshold on DTW distance as a means oftruncating the query list.

The systems as described above in FIGS. 1-8 were applied to twohandwritten word datasets to evaluate the subject ink storage andretrieval system/method. The first dataset is a widely availablestandard dataset from the UNIPEN. The data set used in our experimentscontained all 75,529 isolated cursive and mixed-style handwritten wordsfrom category 6. This dataset has 14,002 unique words written by 505users. The second data set, referred to as US-Natural, is much largerwith 898,652 handwritten words corresponding to 10,329 unique wordswritten by 567 US users. It is important to note that both of thesedatasets are orders of magnitude larger than those used in previousstudies on and conventional techniques in indexing and retrieval ofdigital ink.

Retrieval performance was tested for a) robustness to distortion, b)generalization over a single user's handwriting, and c) generalizationacross different users. Given the large data store sizes, query inputwords and query result sets were simply compared using their word labelsand user names. Three different degrees of similarity are possible:

-   -   a) Most Similar: same word written by the same user;    -   b) Similar: same word written by different users; and    -   c) Dissimilar: different words written by same/different users.        Given that words can be written in print or cursive form, degree        options a) and b) from above may be considered somewhat        interchangeable in their degree of similarity.

Robustness of retrieval performance under writing distortions was testedusing the ink distortion model described above with respect to FIGS. 2and 6 (see FIG. 6 for sample ink distortions). Each sample in thedataset was distorted and used for querying. The noise settings were thesame as before, but the distorted samples for testing were differentfrom the ones used for indexing. The rank of the undistorted sourcesample was used to determine retrieval robustness. Low ranks implyrobust indexing. The same distortion settings were used for bothdatasets.

Every sample in the dataset has a word label and a user label. One ofthe design criteria for retrieval was that ink retrieval should find notonly those samples with very close shape matching, but also should beable to generalize to the same word written by other users. Goodgeneralization performance would indicate that retrieved results wouldnot only contain similar words from the same user but also other users.Each of the datasets was split into two sets based on their (word, user)sample frequencies as follows:

-   -   a) First, all (word, user) pairs that occurred only once were        dropped from the dataset. This can result in a significant loss        in the number of remaining samples (see below).    -   b) Second, each of the remaining (word, user) sample pairs was        split equally between the two sets.        The resulting two sets are of nearly the same size with a near        equal distribution of (word, user) samples.

The UNIPEN dataset has very few (word, user) repeats. As a result over90% of the samples could not be used for generalized tests. The UNIPENdataset produced two sets with 3,076 and 2,553 samples. The US-Naturaldataset resulted in two sets with 515,388 and 383,264 samples. The firstset was indexed to build the data store, while entries from the secondset were used as query words to determine generalization performance. Asin the distortion experiments, the rank of the first query result withthe same (word, user) label was used as a measure of generalizationperformance. Low ranks imply good generalization.

For each of the experiments, three sets of results are presented, namelydefault, wide-1, and brute force DTW. The retrieval results for bruteforce DTW are obtained using a linear scan through the data store usingDTW distance as the similarity measure. The DTW results present an upperbound on the best possible results if computation time was not alimitation. The default retrieval uses one bit per projection dimensionwhile the wide-1 results are for a widening by one bit: that is, ORing 3adjacent bit-dimensions together before ANDing across each of the pdimensions. Furthermore, on each plot, retrieval results (ranks) arepresented against ink samples from the same user (solid lines) and allusers (dashed lines).

FIG. 10 presents the distortion results for the US-Natural dataset as afunction of the top 100 ranks. In 89% of the queries, the default lookupproduces the correct match at rank=1. Widening by 1 bit improves this to92%. The widened version is 3× slower while processing the bit matrixand on average returned 15% more samples. The DTW results indicate thatat best performance is about 95%. Both 89% and 92% results indicateretrieval performance that is robust to input variations. A 2-3%improvement is achieved by extending the match window to the top 10 or100 ranks.

FIG. 11 presents corresponding distortion results for the UNIPENdataset. Retrieval rates under distortion are lower for the UNIPENdataset with rank=1 accuracies being 63%, 66%, and 75% for default,wide-1, and DTW. The DTW results themselves being lower indicates thatmultiple instances of the same word in the UNIPEN dataset have largervariations than those in the US-Natural dataset. For both datasets, aswould be expected, better results are obtained when matching is relaxedto include not only samples from the same user but all users in the datastore. This is clearly indicated by the “All User” retrieval percentagesbeing higher (for the same rank) than the “Same User” percentages.

FIGS. 12 and 13 present the generalization results for US-Natural andUNIPEN datasets, respectively. Retrieval rates at rank=1 on theUS-Natural dataset are 65%, 75%, and 81% for default, wide-1, and DTW,respectively. The corresponding percentages for the UNIPEN dataset are69%, 76%, and 89% for default, wide-1, and DTW, respectively. Retrievalpercentages grow by 4% for US-Natural and 7% for UNIPEN when the top 10retrieved ink words are examined. As with the distortion results, thegeneralization results are better when matching is relaxed to includesamples from all users.

Various methodologies will now be described via a series of acts. It isto be understood and appreciated that the subject system and/ormethodology is not limited by the order of acts, as some acts may, inaccordance with the subject application, occur in different ordersand/or concurrently with other acts from that shown and describedherein. For example, those skilled in the art will understand andappreciate that a methodology could alternatively be represented as aseries of interrelated states or events, such as in a state diagram.Moreover, not all illustrated acts may be required to implement amethodology in accordance with the subject application.

Turning now to FIG. 14, there is a flow diagram of an exemplary method1400 that facilitates indexing electronic ink using redundant bitvectors, thereby improving the speed, accuracy, and overall efficiencyof ink retrieval. The method 1400 involves featurizing the input ink inorder to create a membership matrix for the input ink (e.g., handwrittenword) at 1410. The input ink can be normalized and segmented prior tothe featurization. Thus, each segment can be featurized and classifiedby a neural network.

Following, the membership matrix (for each input ink or ink word) can beprojected into a low dimensional space at 1420 according to selectedprojection directions. OPCA can be employed to find the projectiondirections for the particular membership matrix that maximize signalvariance and minimize noise variance. The low dimensional space can bedefined by at least a subset of OPCA projection directions. Themembership matrix (or matrices) can be converted to at least one vectorand enclosed by a hyperrectangle. The hyperrectangle for each input inkcan be stored using redundant bit vectors in one or more data stores at1430. These redundant bit vectors can be stored in RAM or on disk, as isknown in the art. The bit vectors can also be stored in one (or more)database(s), as long as the database can perform the steps in FIG. 16,below. Those steps can be executed, for example, with stored procedures.Alternatively, the bit vectors can be permanently stored in a database,but operated on in RAM.

The flow diagram in FIG. 15 presents some additional detail regarding amethod 1500 that facilitates ink indexing for efficient and accurateretrieval. In particular, the method 1500 involves normalizing input inkat 1510. The normalization process can include smoothing, centering, andscaling the input ink to a unit height while maintaining the aspectratio. Some or all aspects of the normalization process may not beneeded since many pen-based or ink-based device drivers automaticallysmooth and sub-sample pen input points to remove quantization and otherspurious noise.

At 1520, the input ink can be cut into segments and featurized. Morespecifically, a cut point is determined by a y coordinate at its minimumvalue and where it begins to increase. As discussed earlier, eachsegment can be represented in the form of a Chebyshev polynomial. Thecoefficients of the truncated polynomials can become the segmentfeatures, which are invertible and can contribute to a reconstruction ofthe approximate versions of the source segments. At 1530, the featurizedsegments can be classified using a neural network which can be used togenerate a membership matrix for the input ink. Segment classificationcan significantly reduce or mitigate entropy in stroke features, thusleading to improved retrieval accuracy.

Handwriting has a greater amount of variation compared to othermultimedia data. This is even true when dealing with the same wordwritten multiple times by the same user whether in print or in cursive.Using an ink distortion model, a plurality of distorted versions of theinput ink can be created and used during indexing. In order to make thesubject system and method robust to such variations, low dimensionalrepresentations of the input ink can be built. This can be accomplishedin part by mapping the membership matrix for the input ink to a lowdimensional space. In particular, the membership matrix can be convertedinto a vector, represented as an enclosed hyperrectangle, and projectedinto the low dimensional space according to at least a subset of OPCAprojection directions (at 1540). Membership matrices for the distortedversions of the input ink can be mapped to the low dimensional space aswell. At 1550, the hyperrectangles for each input ink can be indexed ina data store using redundant bit vectors which facilitates speeding upthe query look-up process.

Referring now to FIG. 16, there is a flow diagram of an exemplary inkretrieval method 1600. The method 1600 involves receiving a query ink1610. For example, imagine that a user has stored multiple ink documentsrelating to notes, reports, and project updates on his ink-basedcomputing device (e.g., tablet PC, PDA, desktop computer, mini laptop orPC, etc.). To access at least one stored ink document concerning amobility study, the user enters a query (in ink as well) for mobility(query ink). The query ink can be processed at 1620 in a manner similarto the ink indexing scheme in order to obtain a projected query point(y_(p)). Recall that the query point can be viewed or represented as ahyperrectangle (query hyperrectangle). At 1630, the bit-dimension lengthvector for the query hyperrectangle can be computed. That is, the vectoris of length bp, the number of bit-dimensions that were pre-computed.Thereafter, the data store can be searched in order to find at least onematch between the query hyperrectangle and the query point. The queryresult can be obtained by scanning through a row bit vector of length Mfor non-zero entries at 1650. Thus, any matches found in the data storecan be presented to the user. For example, the user may be presentedwith a list of ink documents that all contain at least one instance ofthe ink word mobility.

Using bit vectors makes the computations very efficient. This is becausethe data store lookup operations boil down to ANDing and ORingoperations on large contiguous blocks of bits. Such operations can bevery efficiently performed on the ever-advancing digital computingdevice.

Moreover, the subject systems and methods facilitate indexing andretrieval of handwritten words (cursive or print) obtained using anelectronic writing device. The handwritten words are first mapped to alow dimension through a process of segmentation, stroke classificationusing a neural network, and projection along directions found usingoriented principal component analysis (OPCA). Using OPCA makes these lowdimensional representations robust to handwriting variations (noise).Each handwritten word is stored along with a neighboring hyperrectanglethat represents word variations. Redundant bit vectors are used to indexthe resulting hyperrectangles for efficient storage and retrieval, asthey are ideal for quickly performing lookups based on point andhyperrectangle hit-tests. Furthermore, the recall and precision valuesobtained can be varied by changing the number of projections (p) usedand the number of quantized levels along each projection (b).

In order to provide additional context for various aspects of thesubject invention, FIG. 17 and the following discussion are intended toprovide a brief, general description of a suitable operating environment1710 in which various aspects of the subject invention may beimplemented. In particular, the subject indexing and retrieval systemsand methods can exist or operate on both portable and non-portabledevices that support ink-based input (via pen, stylus or other inkinginput device).

While the invention is described in the general context ofcomputer-executable instructions, such as different modules orcomponents, executed by one or more computers or other devices, thoseskilled in the art will recognize that the invention can also beimplemented in combination with other program modules and/or as acombination of hardware and software.

Generally, however, program modules include routines, programs, objects,components, data structures, etc. that perform particular tasks orimplement particular data types. The operating environment 1710 is onlyone example of a suitable operating environment and is not intended tosuggest any limitation as to the scope of use or functionality of theinvention. Other well known computer systems, environments, and/orconfigurations that may be suitable for use with the invention includebut are not limited to, personal computers, hand-held or laptop devices,multiprocessor systems, microprocessor-based systems, programmableconsumer electronics, network PCs, minicomputers, mainframe computers,distributed computing environments that include the above systems ordevices, and the like.

With reference to FIG. 17, an exemplary environment 1710 forimplementing various aspects of the invention includes a computer 1712.The computer 1712 includes a processing unit 1714, a system memory 1716,and a system bus 1718. The system bus 1718 couples system componentsincluding, but not limited to, the system memory 1716 to the processingunit 1714. The processing unit 1714 can be any of various availableprocessors. Dual microprocessors and other multiprocessor architecturesalso can be employed as the processing unit 1714.

The system bus 1718 can be any of several types of bus structure(s)including the memory bus or memory controller, a peripheral bus orexternal bus, and/or a local bus using any variety of available busarchitectures including, but not limited to, 11-bit bus, IndustrialStandard Architecture (ISA), Micro-Channel Architecture (MCA), ExtendedISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB),Peripheral Component Interconnect (PCI), Universal Serial Bus (USB),Advanced Graphics Port (AGP), Personal Computer Memory CardInternational Association bus (PCMCIA), and Small Computer SystemsInterface (SCSI).

The system memory 1716 includes volatile memory 1720 and nonvolatilememory 1722. The basic input/output system (BIOS), containing the basicroutines to transfer information between elements within the computer1712, such as during start-up, is stored in nonvolatile memory 1722. Byway of illustration, and not limitation, nonvolatile memory 1722 caninclude read only memory (ROM), programmable ROM (PROM), electricallyprogrammable ROM (EPROM), electrically erasable ROM (EEPROM), or flashmemory. Volatile memory 1720 includes random access memory (RAM), whichacts as external cache memory. By way of illustration and notlimitation, RAM is available in many forms such as synchronous RAM(SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rateSDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), anddirect Rambus RAM (DRRAM).

Computer 1712 also includes removable/nonremovable, volatile/nonvolatilecomputer storage media. FIG. 17 illustrates, for example a disk storage1724. Disk storage 1724 includes, but is not limited to, devices like amagnetic disk drive, floppy disk drive, tape drive, Jaz drive, Zipdrive, LS-100 drive, flash memory card, or memory stick. In addition,disk storage 1724 can include storage media separately or in combinationwith other storage media including, but not limited to, an optical diskdrive such as a compact disk ROM device (CD-ROM), CD recordable drive(CD-R Drive), CD rewritable drive (CD-RW Drive) or a digital versatiledisk ROM drive (DVD-ROM). To facilitate connection of the disk storagedevices 1724 to the system bus 1718, a removable or non-removableinterface is typically used such as interface 1726.

It is to be appreciated that FIG. 17 describes software that acts as anintermediary between users and the basic computer resources described insuitable operating environment 1710. Such software includes an operatingsystem 1728. Operating system 1728, which can be stored on disk storage1724, acts to control and allocate resources of the computer system1712. System applications 1730 take advantage of the management ofresources by operating system 1728 through program modules 1732 andprogram data 1734 stored either in system memory 1716 or on disk storage1724. It is to be appreciated that the subject invention can beimplemented with various operating systems or combinations of operatingsystems.

A user enters commands or information into the computer 1712 throughinput device(s) 1736. Input devices 1736 include, but are not limitedto, a pointing device such as a mouse, trackball, stylus, touch pad,keyboard, microphone, joystick, game pad, satellite dish, scanner, TVtuner card, digital camera, digital video camera, web camera, and thelike. These and other input devices connect to the processing unit 1714through the system bus 1718 via interface port(s) 1738. Interfaceport(s) 1738 include, for example, a serial port, a parallel port, agame port, and a universal serial bus (USB). Output device(s) 1740 usesome of the same type of ports as input device(s) 1736. Thus, forexample, a USB port may be used to provide input to computer 1712 and tooutput information from computer 1712 to an output device 1740. Outputadapter 1742 is provided to illustrate that there are some outputdevices 1740 like monitors, speakers, and printers among other outputdevices 1740 that require special adapters. The output adapters 1742include, by way of illustration and not limitation, video and soundcards that provide a means of connection between the output device 1740and the system bus 1718. It should be noted that other devices and/orsystems of devices provide both input and output capabilities such asremote computer(s) 1744.

Computer 1712 can operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer(s)1744. The remote computer(s) 1744 can be a personal computer, a server,a router, a network PC, a workstation, a microprocessor based appliance,a peer device or other common network node and the like, and typicallyincludes many or all of the elements described relative to computer1712. For purposes of brevity, only a memory storage device 1746 isillustrated with remote computer(s) 1744. Remote computer(s) 1744 islogically connected to computer 1712 through a network interface 1748and then physically connected via communication connection 1750. Networkinterface 1748 encompasses communication networks such as local-areanetworks (LAN) and wide-area networks (WAN). LAN technologies includeFiber Distributed Data Interface (FDDI), Copper Distributed DataInterface (CDDI), Ethernet/IEEE 1102.3, Token Ring/IEEE 1102.5 and thelike. WAN technologies include, but are not limited to, point-to-pointlinks, circuit switching networks like Integrated Services DigitalNetworks (ISDN) and variations thereon, packet switching networks, andDigital Subscriber Lines (DSL).

Communication connection(s) 1750 refers to the hardware/softwareemployed to connect the network interface 1748 to the bus 1718. Whilecommunication connection 1750 is shown for illustrative clarity insidecomputer 1712, it can also be external to computer 1712. Thehardware/software necessary for connection to the network interface 1748includes, for exemplary purposes only, internal and externaltechnologies such as, modems including regular telephone grade modems,cable modems and DSL modems, ISDN adapters, and Ethernet cards.

What has been described above includes examples of the subject systemand/or method. It is, of course, not possible to describe everyconceivable combination of components or methodologies for purposes ofdescribing the subject system and/or method, but one of ordinary skillin the art may recognize that many further combinations and permutationsof the subject system and/or method are possible. Accordingly, thesubject system and/or method are intended to embrace all suchalterations, modifications, and variations that fall within the spiritand scope of the appended claims. Furthermore, to the extent that theterm “includes” is used in either the detailed description or theclaims, such term is intended to be inclusive in a manner similar to theterm “comprising” as “comprising” is interpreted when employed as atransitional word in a claim.

1. A system that facilitates indexing and retrieving electronichandwritten ink characters comprising: an ink processing module to:featurize input ink to yield a membership matrix for each input ink,project at least the membership matrix for each input ink into a lowdimensional space, and determine a projected query point for the queryink, wherein the query point is represented as a query hyperrectangle;and a retrieval module to: search a data store for hyperrectangles thatintersect the query point to produce query results, and widen the queryresults by adding one or more neighboring intervals in each dimensionand increasing a number of ON bits in a binary representation of thequery hyperrectangle.
 2. The system of claim 1, wherein the input inkcomprises at least one of the following: a word, one or more characters,or one or more symbols.
 3. The system of claim 1 further comprises anormalization component that smoothes and scales the input ink to a unitheight while maintaining its aspect ratio to make retrieval of the inputink indifferent to translation and scale variations.
 4. The system ofclaim 1 further comprises a segmentation component that segments eachinput ink into a plurality of segments by culling the input ink where ay-coordinate value of the input ink reaches a minimum value and startsto increase.
 5. The system of claim 1, wherein the ink processing moduleprojects the membership matrix for each input ink into the lowdimensional space in a plurality of projection directions defined by atleast one of the following: oriented principal component analysis (OPCA)and multiple discriminant analysis (MDA).
 6. The system of claim 1,wherein the ink processing module maps the segmented input ink to themembership matrix using a neural network before projecting it into thelow dimensional space.
 7. The system of claim 1, wherein the inkprocessing module creates low dimensional representations of themembership matrix for a given input ink to facilitate ink retrieval thatis robust to user and instance variations.
 8. The system of claim 1,wherein the ink processing computes a binary representation for thequery hyperrectangle.
 9. The system of claim 1, wherein the retrievalcomponent widens the query results by ORing together bit dimensions thatcorrespond to a same projection dimension.
 10. The system of claim 1,wherein the retrieval component widens the query results by: ORingtogether bit dimensions that correspond to a same projection dimension;and ANDing together any resulting bit vectors for each projectiondimension.
 11. The system of claim 1, wherein the retrieval componentoptimizes precision of the query results by ranking the query resultsusing a similarity metric and truncating the query results to return thetop-F best matches.
 12. An ink indexing and retrieval method thatfacilitates retrieving electronic handwritten ink objects comprising:receiving an input ink; featurizing the input ink to yield a membershipmatrix for each input ink; projecting the membership matrix, via aprocessor component, for each input ink into a low dimensional space;determining a projected query point for the query ink, wherein the querypoint is represented as a query hyperrectangle; computing a binaryrepresentation for the query hyperrectangle; indexing each input inkusing redundant bit vectors to build a data store; searching the datastore for hyperrectangles that intersect the guery point to producequery results; widening the query results comprising: adding one or moreneighboring intervals in each dimension and increasing a number of ONbits in the computed binary representation for the query hyperrectangleto widen a bit vector, ORing together bit dimensions that correspond toa same projection dimension, and ANDing together any resulting bitvectors for each projection dimension.
 13. The method of claim 12,wherein featurizing the input ink comprises: splitting the input inkinto a plurality of segments; determining segments features based on theplurality of segments; and classifying the stroke features to mitigateentropy and variance to create a membership matrix for the input ink.14. The method of claim 13 further comprises creating a plurality ofdistorted versions of the input ink and creating membership matrices forthe distorted versions.
 15. The method of claim 14 further comprisesmapping the membership matrices for the distorted versions of the inputink onto the low dimensional space.
 16. The method of claim 12, whereinthe low dimensional space is defined by at least a subset of projectiondirections.
 17. The method of claim 16 further comprises using at leastone of OPCA and MDA to determine the projection directions.
 18. A systemthat facilitates indexing and retrieving electronic handwritten inkcharacters comprising: a processor; and a memory into which a pluralityof computer-executable instructions are loaded, the plurality ofinstructions performing a method comprising: featurizing input ink toyield a membership matrix for each input ink; projecting at least themembership matrix for each input ink into a low dimensional space;indexing each input ink using redundant bit vectors; processing a queryink to yield a query hyperrectangle and searching one or more bit vectorindices for hyperrectangles that at least closely match the query ink;and widening the searching by increasing a number of ON bits in a binaryrepresentation of the query hyperrectangle.